{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9F8XakDB1EBF"
      },
      "source": [
        "# LoRA on FLAN-T5 Small with the HF's Transformers Library\n",
        "This notebook is a companion of chapter 2 of the \"Domain Specific LLMs in Action\" book, author Guglielmo Iozzia, [Manning Publications](https://www.manning.com/), 2025.  \n",
        "The code in this notebook is to introduce readers to a PEFT (Parameter-Efficient Fine-Tuning) technique called [LoRA](https://arxiv.org/abs/2106.09685) (Low Ranking Adaptation). The pre-trained LLM model used as baseline is [FLAN-T5 small](https://huggingface.co/google/flan-t5-small) loaded through the Hugging Face's [Transformers library](https://github.com/huggingface/transformers). It is going to be tuned for text summarization on a subset of the [samsum](https://huggingface.co/datasets/samsum) dataset. Code execution requires a Colab free VM with hardware acceleration (GPU).  \n",
        "More details about this code example can be found in the book's chapter."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nUOuunCJ4exK"
      },
      "source": [
        "Install the missing requirements in the Colab VM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Y-C-a5MI4je9"
      },
      "outputs": [],
      "source": [
        "!pip install datasets peft accelerate bitsandbytes evaluate rouge_score py7zr"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Kj2sOim-PsGJ"
      },
      "outputs": [],
      "source": [
        "import locale\n",
        "\n",
        "original_getpreferredencoding = locale.getpreferredencoding\n",
        "\n",
        "def getpreferredencoding_wrapper(do_raise=True):\n",
        "  return original_getpreferredencoding()\n",
        "\n",
        "locale.getpreferredencoding = getpreferredencoding_wrapper"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HX-dTMQT7okX"
      },
      "source": [
        "### Data Preparation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5wDpxFy24xJv"
      },
      "source": [
        "Load the **sansum** dataset from the HF Hub."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6gYaE3eXdSHV"
      },
      "outputs": [],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "dataset = load_dataset(\"knkarthick/samsum\", trust_remote_code=True)\n",
        "\n",
        "print(f\"Train dataset size: {len(dataset['train'])}\")\n",
        "print(f\"Test dataset size: {len(dataset['test'])}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "521yXget4-8e"
      },
      "source": [
        "Load the FLAN-T5 small model tokenizer from the HF Hub."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "V4OgPb4mdXxj"
      },
      "outputs": [],
      "source": [
        "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
        "\n",
        "model_id=\"google/flan-t5-small\"\n",
        "\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7WYc1sLIeh5R"
      },
      "source": [
        "Some preprocessing of the training/test data is needed.  \n",
        "We need to truncate training and test sequences that are longer than the maximum input sequence after tokenization and pad those that are shorter. This applies to both input and target.  \n",
        "For the input, we take the 85 percentile of the max length for better utilization."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "N5rqNJbBejzg"
      },
      "outputs": [],
      "source": [
        "from datasets import concatenate_datasets\n",
        "import numpy as np\n",
        "\n",
        "combined_dataset = concatenate_datasets([dataset[\"train\"], dataset[\"test\"]])\n",
        "\n",
        "tokenized_inputs = []\n",
        "for example in combined_dataset:\n",
        "    if example[\"dialogue\"] is not None:\n",
        "        tokenized_input = tokenizer(text=example[\"dialogue\"], truncation=True)\n",
        "        tokenized_inputs.append(tokenized_input[\"input_ids\"])\n",
        "\n",
        "input_lenghts = [len(x) for x in tokenized_inputs]\n",
        "max_source_length = int(np.percentile(input_lenghts, 85))\n",
        "print(f\"Max source length: {max_source_length}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HkgOCX316cRX"
      },
      "source": [
        "For the target, we take the 90 percentile of the max length for better utilization."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NE0KYBeW6c3J"
      },
      "outputs": [],
      "source": [
        "tokenized_targets = concatenate_datasets(\n",
        "    [dataset[\"train\"], dataset[\"test\"]]).map(\n",
        "        lambda x: tokenizer(x[\"summary\"], truncation=True), batched=True,\n",
        "        remove_columns=[\"dialogue\", \"summary\"])\n",
        "target_lenghts = [len(x) for x in tokenized_targets[\"input_ids\"]]\n",
        "max_target_length = int(np.percentile(target_lenghts, 90))\n",
        "print(f\"Max target length: {max_target_length}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zs-zC9N5_EkF"
      },
      "source": [
        "We can now define a single function that executes all the preprocessing steps (input tokenization, truncation and padding)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "npL8eEHUezui"
      },
      "outputs": [],
      "source": [
        "def preprocess_function(sample, padding=\"max_length\"):\n",
        "    # Filter out examples where dialogue is None and keep corresponding summaries\n",
        "    processed_samples = [(dialogue, summary) for dialogue, summary in zip(sample[\"dialogue\"], sample[\"summary\"]) if dialogue is not None]\n",
        "\n",
        "    inputs = [\"summarize: \" + dialogue for dialogue, summary in processed_samples]\n",
        "    labels_text = [summary for dialogue, summary in processed_samples]\n",
        "\n",
        "    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)\n",
        "\n",
        "    labels = tokenizer(text_target=labels_text, max_length=max_target_length, padding=padding, truncation=True)\n",
        "\n",
        "    if padding == \"max_length\":\n",
        "        labels[\"input_ids\"] = [\n",
        "            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels[\"input_ids\"]\n",
        "        ]\n",
        "\n",
        "    model_inputs[\"labels\"] = labels[\"input_ids\"]\n",
        "    return model_inputs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7bfypwiXAnCv"
      },
      "source": [
        "Apply the function defined at the previous cell to the tokenized dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U0V-hrRnAlb4"
      },
      "outputs": [],
      "source": [
        "tokenized_dataset = dataset.map(preprocess_function, batched=True,\n",
        "                                remove_columns=[\"dialogue\", \"summary\", \"id\"])\n",
        "print(f\"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Faacsmv1A3Zm"
      },
      "source": [
        "Save the preprocessed datasets to disk to reuse them later."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9cfgZ_WyA3GD"
      },
      "outputs": [],
      "source": [
        "tokenized_dataset[\"train\"].save_to_disk(\"data/train\")\n",
        "tokenized_dataset[\"test\"].save_to_disk(\"data/eval\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SXWXLIUqe-Lj"
      },
      "source": [
        "### Fine tuning with LoRA and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes#) int8."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GK32DcRR9Axp"
      },
      "source": [
        "Load the FLAN-T5 small model in 8-bit precision from the HF's Hub."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RNWrA_oMe5gZ"
      },
      "outputs": [],
      "source": [
        "from transformers import AutoModelForSeq2SeqLM\n",
        "\n",
        "model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True,\n",
        "                                              device_map=\"auto\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4CaMie-K9lHf"
      },
      "source": [
        "Define the LoRA configuration, prepare the model for training, and add the LoRA adaptor to it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AUxXEEpLfo9T"
      },
      "outputs": [],
      "source": [
        "from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType\n",
        "\n",
        "lora_config = LoraConfig(\n",
        " r=16,\n",
        " lora_alpha=32,\n",
        " target_modules=[\"q\", \"v\"],\n",
        " lora_dropout=0.05,\n",
        " bias=\"none\",\n",
        " task_type=TaskType.SEQ_2_SEQ_LM\n",
        ")\n",
        "\n",
        "model = prepare_model_for_kbit_training(model)\n",
        "\n",
        "model = get_peft_model(model, lora_config)\n",
        "model.print_trainable_parameters()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kmu9xkLZ-93-"
      },
      "source": [
        "At the end of the execution of the code cell above, the number of parameters to train should be < 1% of the total for the model.  \n",
        "The training process is the same as regular LLM training, the main difference is in model to be trained, which is the one after submission to LoRA.  \n",
        "Define a Data Collator for this training:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "STIyxRkGfqt0"
      },
      "outputs": [],
      "source": [
        "from transformers import DataCollatorForSeq2Seq\n",
        "\n",
        "label_pad_token_id = -100\n",
        "data_collator = DataCollatorForSeq2Seq(\n",
        "    tokenizer,\n",
        "    model=model,\n",
        "    label_pad_token_id=label_pad_token_id,\n",
        "    pad_to_multiple_of=8\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YGCxnIAaAF2y"
      },
      "source": [
        "Set the training arguments and use them to create a Trainer instance. For this use case training for 3 epochs should be enough.  \n",
        "Model warnings have been silenced to make the output at training time less verbose and more readable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C55PvPoHfyWq"
      },
      "outputs": [],
      "source": [
        "from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments\n",
        "\n",
        "output_dir=\"lora-flan-t5-small\"\n",
        "\n",
        "training_args = Seq2SeqTrainingArguments(\n",
        "    output_dir=output_dir,\n",
        "\tauto_find_batch_size=True,\n",
        "    learning_rate=1e-3,\n",
        "    num_train_epochs=3,\n",
        "    logging_dir=f\"{output_dir}/logs\",\n",
        "    logging_strategy=\"steps\",\n",
        "    logging_steps=500,\n",
        "    save_strategy=\"no\",\n",
        "    report_to=\"tensorboard\",\n",
        ")\n",
        "\n",
        "trainer = Seq2SeqTrainer(\n",
        "    model=model,\n",
        "    args=training_args,\n",
        "    data_collator=data_collator,\n",
        "    train_dataset=tokenized_dataset[\"train\"],\n",
        ")\n",
        "model.config.use_cache = False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-gkaX7rLAuZE"
      },
      "source": [
        "Start the training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qbgb7jBvf1lQ"
      },
      "outputs": [],
      "source": [
        "trainer.train()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EbQniYS5k_19"
      },
      "source": [
        "Save the fine-tuned model to disk."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "z31TUUcQlB5D"
      },
      "outputs": [],
      "source": [
        "lora_model_id=\"flan_t5_lora\"\n",
        "trainer.model.save_pretrained(lora_model_id)\n",
        "tokenizer.save_pretrained(lora_model_id)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "myEs2ZKJjVJ_"
      },
      "source": [
        "### Inference and Evaluation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DDNUhBGxA6r0"
      },
      "source": [
        "Prepare the model for inference. Load the LoRA configuration and checkpoints, reload the base model and merge the weights."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Q3IQc4aHf5Ro"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "from peft import PeftModel, PeftConfig\n",
        "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n",
        "\n",
        "config = PeftConfig.from_pretrained(lora_model_id)\n",
        "\n",
        "model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={\"\":0})\n",
        "tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)\n",
        "\n",
        "model = PeftModel.from_pretrained(model, lora_model_id, device_map={\"\":0})\n",
        "model.eval()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "97pexBJnCupd"
      },
      "source": [
        "Perform inference (text summarization) on a random subset of the test samples."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kfRyaIonmMSx"
      },
      "outputs": [],
      "source": [
        "from random import randrange\n",
        "from datasets import load_dataset\n",
        "\n",
        "sample = dataset['test'][randrange(len(dataset[\"test\"]))]\n",
        "\n",
        "input_ids = tokenizer(sample[\"dialogue\"], return_tensors=\"pt\",\n",
        "                      truncation=True).input_ids.cuda()\n",
        "outputs = model.generate(input_ids=input_ids, max_new_tokens=10,\n",
        "                         do_sample=True, top_p=0.9)\n",
        "print(f\"input sentence: {sample['dialogue']}\\n{'---'* 20}\")\n",
        "\n",
        "print(f\"summary:\\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sXcrllW6DZnt"
      },
      "source": [
        "Define a function to evaluate the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2FQnVciODdvQ"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "\n",
        "def evaluate_peft_model(sample, max_target_length=50):\n",
        "    outputs = model.generate(input_ids=sample[\"input_ids\"].unsqueeze(0).cuda(),\n",
        "                             do_sample=True, top_p=0.9,\n",
        "                             max_new_tokens=max_target_length)\n",
        "    prediction = tokenizer.decode(\n",
        "        outputs[0].detach().cpu().numpy(), skip_special_tokens=True)\n",
        "    labels = np.where(sample['labels'] != -100,\n",
        "                      sample['labels'], tokenizer.pad_token_id)\n",
        "    labels = tokenizer.decode(labels, skip_special_tokens=True)\n",
        "\n",
        "    return prediction, labels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-w82w-gKDKYt"
      },
      "source": [
        "Evaluate the model ([*ROUGE* score](https://huggingface.co/spaces/evaluate-metric/rouge)) on the test dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CbomJQmBmNWA"
      },
      "outputs": [],
      "source": [
        "import evaluate\n",
        "from datasets import load_from_disk\n",
        "from tqdm import tqdm\n",
        "\n",
        "metric = evaluate.load(\"rouge\")\n",
        "\n",
        "test_dataset = load_from_disk(\"data/eval/\").with_format(\"torch\")\n",
        "\n",
        "predictions, references = [] , []\n",
        "for sample in tqdm(test_dataset):\n",
        "    p,l = evaluate_peft_model(sample)\n",
        "    predictions.append(p)\n",
        "    references.append(l)\n",
        "\n",
        "rogue = metric.compute(predictions=predictions,\n",
        "                       references=references,\n",
        "                       use_stemmer=True)\n",
        "\n",
        "print(f\"Rogue1: {rogue['rouge1']* 100:2f}%\")\n",
        "print(f\"rouge2: {rogue['rouge2']* 100:2f}%\")\n",
        "print(f\"rougeL: {rogue['rougeL']* 100:2f}%\")\n",
        "print(f\"rougeLsum: {rogue['rougeLsum']* 100:2f}%\")"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}